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Abstract 



We develop an extension of the classical Zellncr's g-prior to generalized linear mod- 
els. The prior on the hyperparameter g is handled in a flexible way, so that any 
continuous proper hyperprior /(<?) can be used, giving rise to a large class of hyper- 
g priors. Connections with the literature are described in detail. A fast and accurate 
integrated Laplace approximation of the marginal likelihood makes inference in large 
I/-} model spaces feasible. For posterior parameter estimation we propose an efficient 

and tuning-free Metropolis-Hastings sampler. The methodology is illustrated with 
variable selection and automatic covariate transformation in the Pima Indians dia- 
betes data set. 
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1. Introduction 



Assume that we have observed responses yi coming from a generalized linear model 



(GLM, see McCullagh and Nelder, 1989) incorporating the covariate vectors X{ G W via 
the linear predictors rji = (3q + xJ/3, i = 1, ... ,n. The response function (inverse link 
function) h transforms 77, to the mean E(yj) = Hi = h(r]i), which in turn is mapped to the 
canonical parameter 9{ = (db / 'dQ)~ l (//j) of the exponential family. Here db/d9 is the first 
derivative of the function b that defines the form of the likelihood for y = (y±, . . . , y n ) T 
via 



f(y\Po,f3)<xex P \J2 yA - b ^ \, (1) 



where each #j depends on the intercept /3q and the vector (3 of regression coefficients as 
described above. Often the canonical response function h = db/d9 is used which leads 
to the identity 9i = rji = (3q + xJ/3. The dispersions 4>i = 4>/ w i are assumed known and 
can incorporate weights u>«. The variance Var(yj) = <pid 2 b/d9 2 (9i) is expressed through 
the variance function v(fj,i) = d 2 b/d9 2 ((db/d9)~ 1 (fii)) as Var(yj) = cf>iv(fjLi). 

A Bayesian analysis starts by assigning prior distributions to the unknown model 
parameters /3q and (3. However, usually there is not only uncertainty with respect to 



the model parameters, but also to the model itself, see e.g. Clyde and George (2004). 
Let 7 be the model index contained in some model space T. Typically, the variable 
selection problem is considered, where 7 E {0, l} m contains binary inclusion indicators 
for all m available covariates. Here we think more generally of uncertainty about the 
form (including the dimension p 7 ) of the covariate vectors x™, which may also comprise 
different transformations of the original variables. For example, when 7 indicates a 
quadratic transformation of Xj, then x^i = (xi,xf) T . Thus, priors /(A))Ay|7) need 
to be assigned, for all models 7 G F. Manual elicitation of all these priors is clearly 
infeasible when V is large. In this situation priors which automatically derive from 7 are 
attractive, and we will propose such a prior in this paper. Model inference then uses the 
posterior model probabilities 

/(7 I y)«x/(y 1 7 )/(7), 7er, (2) 
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which combine the marginal likelihood 



f(y 1 7) 



Z(y|/3o,/3 7 , 7 )/(/3oAl7)W/3 n 



(3) 



with the prior model probabilities 7(7). 

In the special case of the classical normal linear model with known error variance <j> 
and Wi = l, the g-prior for the regression coefficients was proposed by Zellner (1986) as 



a "reference informative prior". It is the normal distribution with zero mean vector and 
covariance matrix g<f>(X'^ / X^)^ 1 , 



(4) 



and is usually combined with a locally uniform (Jeffreys) prior on (3q, assuming that the 
design matrix X 7 = (a: T i, . . . ,x in ) T has been centered to ensure X^l n = P7 . Often 
also the error variance 4> is assumed unknown and assigned a Jeffreys prior. The <?-prior 
can be interpreted as the conditional posterior of /3 7 given a locally uniform prior and 
an imaginary sample y = n from the normal linear model with original design matrix 
Xry and scaled error variance g<p. This implements the idea that after accounting for the 
mean level /?o not included in the g-prior, there is no difference between observations due 
to the covariates in X^ modelled through /3 . In addition to this nice interpretation, 
the g-prior has other advantages, such as invariance of the implied prior for the linear 



predictor under rescaling and translation of the covariates (Robert and Saleh, 1991 



p. 71), and automatic adaption to situations with near-collinearity between different 



covariates (Robert, 2001 p. 193). 

The hyperparameter g > in Q acts as an inverse prior sample size, and is thus 
very sensitive to prior elicitation. Larger values of g lead to preference of less complex 



models, a phenomenon known as the Lindley- Jeffreys paradox (Lindley, 1957 see also 



Robert, Chopin, and Rousseau, 2009, p. 161). Moreover, a fixed g does not allow the 



Bayes factor of a perfectly fitting model versus the null model go to infinity (Berger 



and Pericchi, 2001). Therefore, much research has been done in developing automatic 



specifications of g (George and Foster, 2000; Hansen and Yu, 2001 Fernandez, Ley 



and Steel, 2001 Cui and George, 2008). The multivariate Cauchy priors of Zellner 



and Siow (1980) correspond to fully Bayesian inference with an inverse-gamma prior 



for g. Unfortunately, the corresponding marginal likelihood f{y | 7) has no closed form. 
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Therefore Liang, Paulo, Molina, Clyde, and Berger (2008) proposed the hyper-g prior, 



which is a special case of the incomplete inverse-gamma prior by Cui and George ( 2008 ) . 
These hyperpriors retain a closed form expression for f(y | 7) which is vital for efficient 
model inference. 

In this article we develop an extension of the classical g-pviov Q to GLMs. The prior 
on the hyperparameter g is handled in a flexible way, so that any continuous proper 
hyperprior f(g) can be used. In Section [2j this generalized hyper-g prior is derived 
and connections with the literature are described. Because model inference is the main 
practical use of this automatic prior formulation, we will propose a fast and accurate 
numerical approximation of the marginal likelihood in Section [3j Section [3] also covers 
posterior parameter estimation with a tuning- free Markov chain Monte Carlo (MCMC) 
sampler. The methodology is applied to variable selection in Section [4] and to fractional 
polynomial modelling in Section [5j Section [6] discusses possibilities for future research. 



2. The generalized hyper-g prior 



2.1. Prior construction 



Consider the imaginary sample y = h(0)l n from the GLM with original design matrix 
X 7 and weights vector w = (wi, . . . , w n ) T , but scaled dispersion g<p. Using an improper 
flat prior for the regression coefficients vector /3 , its posterior given y is proportional 
to the likelihood 0, 



/(Ay I V 0,9,1) exp 



— h(0)wi9i - Wi b{ei) 



(5) 



This distribution can be recognized as the Chen and Ibrahim (2003, formula 2.6) prior, 
although the authors have only considered the case with Wi = 1. Similar to their the- 
orem 3.1, we can prove that the mode of this distribution is at (3 = P7 (see Ap- 



pendix |AJ). Moreover, it results from standard Bayesian asymptotic theory (e.g. Bern- 
ardo and Smith, 2000, p. 287) that this distribution converges for n — > 00 to the normal 



distribution 



(6) 
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Family 



Link 



Gaussian 



Poisson 



Bernoulli 



Gamma 

Inverse Gaussian 



Identity 1 

(Log) 1 

Log 1 

Identity (0) 

Logit 4 

Cauchit 7r 2 /4 

Probit vr/2 
Complementary log-log e — 1 

Log 1 

(Log) 1 



Table 1 - Exponential families, usual link functions and resulting factors c. Note that for the 
Gamma and the Inverse Gaussian family, the natural links /i -1 and /i~ 2 , respectively, 
cannot be used because then h(0) = oo. Parenthesized links should not be used 
because the uniqueness of the prior mode at (3 — p „ is not sure (Wedderburn 



1976). Parenthesized c's point out problems there. 



where c = v(h(0)) • dh/di](0)~ 2 and W = diag(tu), because the inverse of the expected 
Fisher information I(/3 7 ) evaluated at the mode is I(0 P7 ) _1 = g4>c(X^W (cf. 
Chen and Ibrahim, 2003, theorem 2.3). The "generalized g-prior" ^ differs from the 
standard g-prior Q only by the constant c and the weight matrix W. Both are especially 
important in binomial regression when w% is the sample size of the observed proportion, 
say yi = Si/wi if S{ ~ B'm(wi, Mi) is the number of successes: In Table [l] it can be seen 
that only for the Bernoulli family 1. 

Since the intercept /3q parametrizes the average linear predictor in each model, we can 
use the improper flat prior /(/?o) 1- Thus, our generalized g- prior does not shrink the 
intercept towards zero, while the prior on the regression coefficients again implements the 
non-informative prior idea that X 7 has o priori no effect on the centered observations. 
The factor g is assigned a (continuous) hyperprior f(g), which we treat generally in the 
paper. The hyperprior f(g) used in our approach must be proper to ensure that Bayes 
factor comparisons with the null model, which does not include the parameter g, are 
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valid. As g is assigned a hyperprior, we call the resulting prior a "generalized hyper-g 
prior". 



2.2. Comparison with the literature 



An immediate question is why we do not use the exact Chen and Ibrahim (2003) prior, 



which is also a generalization of the standard g-prior. The main problem with this 
conjugate prior given in ([5]) is that it does not have a closed form for non-normal expo- 
nential families, i. e. the normalizing constant of §5§ is unknown. This complicates the 



computation of the marginal likelihood and the MCMC sampling considerably. Chen, 



Huang, Ibrahim, and Kim (2008) propose a solution where they run an MCMC sampler 



on the full model, and then derive estimates for submodels. However, this approach is 
not applicable in problems with simultaneous variable selection and transformation as 
that presented in Section [5] because no full model exists in that case. Regarding the 



hyperparameter g, Chen and Ibrahim (2003) propose to assign an inverse-gamma prior 
to it. 



Alternatively, Gupta and Ibrahim (2009 ) proposed the information matrix prior, which 



uses the expected Fisher information matrix I(/3_,) similarly to a precision matrix for a 
normal distribution up to a scalar variance factor g: 



fGl(P y \g,1f) oc |/(/3 7 )| 1/2 exp 



1 

'25 



^/(/3 7 )/3 7 



(7) 



This will only be a Gaussian distribution if the matrix I(/3 7 ) actually does not depend 
on /3 7 , e. g. for the normal linear model where the standard g-prior is reproduced by ([7]). 
By contrast, the precision of our generalized g-prior in ^ results from evaluating I(/3 7 ) 



at the prior mode, producing a matrix which does not depend on (3 y . Gupta and Ibrahim 



(2009) fix the hyperparameter g at a "moderately large" value (g > 1) and do not consider 



inference for it. 

The information matrix prior is strongly linked with the unit information prior ap- 



proach of Kass and Wasserman (1995), who proposed the general idea that the amount 



of information in the prior on j3^ should be equal to the amount of information about 
it contained in one observational unit. The amount of information is measured by the 
(expected) Fisher information, so that the precision is chosen as n _1 /(0 P7 ) in the normal 
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prior 



W/3 7 I 9,7) = N P7 (/3 7 | P7 ,n/(0 P7 )~ 1 ) . 
This proposal is close to ours in ([6]), except that the hyperparameter is fixed at g 



(8) 



n. 



Note that Kass and Wasserman ( 1995 ) also required the nuisance parameter /3q to be null- 



orthogonal to the parameter of interest (3 , which we ensure by centering the covariates 



around zero. The unit information prior was used by Ntzoufras, Dellaportas, and Forster 



(2003) and Overstall and Forster (2010) in the GLM context. 



Hansen and Yu (2003, p. 156) also use the expected Fisher information, but evaluate 



it at the maximum likelihood (ML) estimate /3 7 to obtain a prior precision matrix: 



/ffy (/3 7 1*7,7) = N P7 /3 7 | P7 , 5 /(/3 7 ) 



(9) 



Hansen and Yuj find the dependence of their prior on the data y "hard to accept", although 
it can be interpreted as an empirical-Bayes approach. Also in this flavour, the authors 
maximize the (approximate) conditional marginal likelihood in order to eliminate g. 
Subsequent model selection is then based on a modified conditional marginal likelihood 
("minimum description length"). 
Instead of using the expected Fisher information matrix I(/3 ), 



Wang and George 



(2007) use the observed Fisher information matrix J(/3 7 ). While for canonical response 
functions the equality /(/3 7 ) = <7(/3 7 ) holds, in general J(/3 7 ) is different and depends 
on the observed response vector. Wang and George ( 2007| ) evaluate it at the original 
response y and the ML estimate to obtain the correlation structure of the normal dis- 
tribution: 



W/3 7 |9,7) = N P7 [p^Q^gJipj- 1 ] . (10) 
By comparison, our generalized g-prior ^ does not use the original data y, but only the 



design matrix X 7 . Analogously to Hansen and Yu ( 2003 ) , Wang and George ( 2007 ) select 



model-specific values for g by maximizing the conditional marginal likelihood f(y \ g, 7), 
but they also consider fully Bayesian inference for g. 



Marin and Robert (2007, p. 101) avoid the use of a Fisher information matrix alto- 



gether when they propose the "non-informative g-prior" 



W#yl<?,7) =N P7 {p^Op^giX^X^)- 1 ) 



(11) 
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for binary regression with probit and logit link functions. The factor g is assigned the 
improper prior f(g) oc g _3//4 , which can be regarded as a degenerate inverse-gamma 
distribution with shape —1/4 and scale 0. Note that using this improper hyperprior 
prohibits Bayes factor comparisons with the null model. 



3. Implementation 

3.1. Marginal likelihood computation 

Under the generalized hyper-^ prior, the marginal likelihood introduced in ([3]) is 

f(y 1 7) = j f(y\ A), /3 7 , 7) j /(Ay 1 g, i)f(g) dg W/3 7 



f(y\g,i)f(g)dg, (12) 

where the conditional marginal likelihood of the GLM 7 (given g) is 

/(Vl<7, 7)= j /(y|/?o,A 7 ,7)/(A 7 ls,7)WA r (13) 



Note that both (13) and (12) are only defined up to a constant, which we have fixed at 
unity, as we use the improper prior /(/?o) ^ 1- I n general, no closed form expressions 
are available. The obvious exception is the special case of normal likelihood, which was 
mentioned in Section [T] and will be referred to again later on in this section. Therefore, 
in order to be able to efficiently explore a large model space T, we need to develop a 
fast but accurate numerical approximation to the marginal likelihood. This will be a 



two-step procedure: The conditional marginal likelihood (13) is computed by a Laplace 



approximation. Plugging this into (12), the hyperparameter g will be integrated out 



with respect to its prior by numerical integration. Together, this is an integrated Laplace 



approximation (ILA), which was proposed more generally by Rue, Martino, and Chopin 



(2009). 

For ease of notation, denote by /3 07 = (A),/3 7 ) T the vector of all coefficients. Then 
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the Laplace approximation (Lindley, 1980 Tierney and Kadane, 1986) of (13) is 



f{v\9,i) 



f(P&j\v,9,i) 



f(y | /3* 07 , 7 ) (2tt#c) -v' 2 | I^X 7 1 1/2 exp { - - (g^p^X^WX^ X 



(2vr )(p+i)/2 \ R 



-1/2 



(14) 



when /(/3q 7 | y,g, 7) is the Gaussian approximation of the conditional coefficients pos- 
terior with mean vector /3g 7 and precision matrix -Rg 7 - Since the conditional coefficients 
prior can be seen to have a normal kernel /(/3 07 | g, 7) oc exp { — t^^-Rc^A^} with 
(singular) precision 



diag{0,(^c)- 1 X^X 7 }, 



(15) 



the Bayesian iterative weighted least squares (IWLS) algorithm (West, 1985. Gamerman 



1997) can be used to compute the moments of the Gaussian approximation. Note that 



this is different and potentially more accurate than the approach by Rue et al. (2009 



p. 327) who preserve the sparsity of the prior precision Rq^ in the resulting posterior 
precision Rq^. The accuracy of the Laplace approximation (14) can be even further 



improved by including higher-order terms of the underlying Taylor expansion. For ca- 



nonical response functions, Raudenbush, Yang, and Yosef (2000) derived a convenient 



correction factor corresponding to a sixth-order Laplace approximation. In the applic- 
ations of Sections |4] and [5j we have used this correction (see Appendix [B] for details), 
which clearly improved the ILA while requiring only slightly more computation time. 
The one-dimensional integration in (12) is performed on the log-scale over z = log (g) 



using Gauss-Hermite quadrature. First, we find the (approximate) posterior mode z* 
and variance a* 2 of z using its unnormalized (approximate) posterior density 



/(*,!/ 1 7) = f(y\z,7)f(z). 



(16) 



The mode z* is numerically determined by the optimize routine in R (R Development 
Core Team, 2010 Brent, 1973). The variance a* 2 can be computed as the negative 



inverse second derivative of the log posterior at z* by numerical differentiation (routine 
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df ridr from Press, Teukolsky, Vetterling, and Flannery, 2007, p. 231). Second, we apply 



the Gauss-Hermite quadrature (Naylor and Smith, 1982) 



N 



f(y\i) ~^2™>jf(zj,y\i), 



(17) 



where the actual weights rrij = ujj exp(t|)v2o"* and nodes Zj = z* + y2a*tj depend on 
z* , a* as well as original weights u)j and nodes tj, j = 1, . . . , N. These can be obtained 



from the Golub and Welsch (1969) algorithm, which is implemented in the R- function 
gauss. quad (Smyth, Hu, and Dunn 2009). N = 20 seems to be sufficient, given that 
this includes nodes in a range of about seven standard deviations around z* (as then 
\/2i20 ~ 7.6). Note that the Gauss-Hermite approximation in (17) is exact if f(z,y | 7) 
is the product of N(z \ z*,a* 2 ) and a polynomial of at most order 2N 



1. 



3.2. Metropolis- Hastings sampler 

Given a model 7 £ T we would like to sample from the joint posterior of the model-specific 
parameters # 7 = ((3q 1 ,z) t . To this end, we propose a tuning-free Metropolis-Hastings 
(MH) sampling scheme with proposal kernel 

q(0'^\0 J ) = q(l3' (h \z',l3 0l )q(z') (18) 

for the proposal # 7 given the current sample 7 . The independence proposal density 
q(z') is constructed by linearly interpolating pairs (zj, f(zj, y | 7)) and then normalizing 
this function to unity integral. Note that many pairs are already available from the op- 
timization and integration of ( |16| ) in the marginal likelihood computation. Thus, q(z) is 
close to the posterior density f(z | y,j), suggesting high acceptance rates of the sampler. 
Also, generating random variates from q(z) using inverse sampling is straightforward as 
the corresponding cumulative distribution function is piecewise quadratic. 

For the coefficients, g(/3'o7 I z ' \ A^) i s a Gaussian proposal density: Starting from the 
current vector /3 07 and the proposed prior covariance factor g' = exp(z'), a single step 
of the Bayesian IWLS is made, resulting in the mean vector and the precision matrix of 



the proposal (Gamerman, 1997). In order to compute the acceptance probability of the 



move from # 7 to 6' 



a{6' I O 



1 A 



/(y|/3 Or7 )/(0 7 |7) 9(0 7 |0; 



/(y|/3 O7 ,7)/(0 7 |7) q{0' y I #1 



(19) 
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note that the prior contributions have the form f(0 y | 7) = /(/3 7 | ff,7)/(ff)g, the last 
factor g being due to the change of variable z = log(g) in the proposal parametrization. 
For the reverse proposal kernel value q(0^ \ 0' ), another IWLS step starting from the 
proposed vector /3q and the current factor g = exp(z) is necessary. 

Besides producing parameter samples from the posterior /(# 7 | y, 7), the MH sampler 
can also be used to compute an MCMC estimate of the marginal likelihood /(1/I7), 
thereby providing an independent check of the numerical estimate presented in Sec- 
tion 3.1 We will use the method by Chib and Jeliazkov (2001 section 2.1), which was 



competitive in a review by Han and Carlin (2001) and is still a benchmark for new de- 



f(y I 7) 



(20) 



velopments (see e.g. Nott, Kohn, and Fielding, 2008). The estimate is based on the 
basic identity 

/te|g;,7)/(g;i7) 

f(o;\y,i) 

where 6* is usually chosen as a configuration with high posterior density which is fixed 
before the MCMC sampling. Then the detailed balance of the Markov chain ensures 
that the unknown posterior ordinate can be estimated by 



m\y,i) 



,,=i«(0;i07 



where the 0)3' are the posterior samples and the 6)^' are iid draws from the proposal 



(k) 



(21) 



distribution q(6^ \ 0*). Since each acceptance probability in (21) requires two additional 
IWLS steps, 4B additional IWLS steps are required for the Chib and Jeliazkov (2001) 
estimate if B posterior samples are used. 



3.3. Performance in the conjugate case 

For illustration of the performance of the proposed implementation, we consider the 
special case of normal linear regression with fixed error variance <f>. Using the g-prior Q, 
the conditional coefficients posterior is Gaussian, 

/(A) 7 I 2/, 9, 7) = N (A) I V, 0/n) N (/3 7 I g(g + g(g + 1)- V(* 7 X^ 1 ) , (22) 

where the ordinary least squares estimate /3 7 = (X^X 7 )~ 1 X^i/ is shrunk by the factor 
g(g + 1) _1 . Thus, the Laplace approximation (|14[) of the conditional marginal likelihood 
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is exact. It is given by 

f(y\g,7) = (g + l)- p -< /2 e W \(g + l)- 1 



SSR^ 

2(j) 



■ cxp 



SSE^ 



(23) 



where SSE^ and SSR^ are the error and regression sums of squares, respectively. From 



the form of (23) we see that an inverse-gamma prior IG(a, b) on g + 1 will be conjugate 



to this likelihood. Since g > must be ensured, this distribution must be truncated to 



(l,oo), yielding the incomplete inverse- gamma prior (Cui and George, 2008, p. 891) 

f{g) = M(a, b)(g + 1)"> +1 ) exp{-6/( 5 + 1)} (24) 



with the normalising constant 



M(a,b) 



b" 



Jo V" 1 exp(-i) dt 



(25) 



and the corresponding marginal likelihood 

M{a,b) 



f(y\i) 



exp 



SSE^, 



M(a 7 ,6 7 ) 
= a + p 7 /2 and 6 7 



20 



(26) 



SSR 1 /{24>) + b determine the 



where the updated parameters a 7 
posterior of g in model 7. 

In order to show results from a real data set, we consider the ozone data introduced by 



Breiman and Friedman (1985) in the notation of Sabanes Bove and Held (2010), where 



n = 330. Deciding whether to include each of the nine meteorological covariates zq and 
Z4, . . . , z\\ in the linear regression of the daily maximum ozone concentration y yields a 
model space T of size 2 9 = 512. For all 7 E T, the ILA (fT/l) and the MCMC estimate pjl) 



of the exact marginal likelihood value (26) were computed using the fixed (full-model 
variance estimate) (f> = 19.75 and the hyperprior parameters a = 0.01, b = 0.01. Figure[l] 
shows that the errors of ILA and MCMC are very small here compared to the absolute 
true values. 

For all models, the acceptance rates of the MH algorithm were above 97%. Figure [2] 
shows that even for the model with the lowest acceptance rate, the true posterior density 
of z = log(<7) is very close to its ILA estimate q(z). This explains the almost perfect 
acceptance rates of the MH scheme. 
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Figure 1 - Errors of ILA and MCMC estimates (y-axes) compared to the exact marginal like- 
lihood values (x-axes) for all 512 models. MCMC estimates are based on B = 4500 
samples which were saved after burn-ins of length 1000 (every 2nd iteration). Note 
that the marginal likelihood values include the additional constant factor y / 2n(f>/n 



compared to (26) 
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(a) Errors of the ILA estimates. 



(b) Errors of the MCMC estimates. The vertical bars 
represent 95% MCMC confidence intervals (cover- 
age is 95.1% here). 
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Figure 2 - True posterior density of z (solid line) compared with the TLA (dashed line) and 
MCMC (histogram) estimates. Small ticks above the horizontal axis indicate where 



nodes Zj for the construction of the TLA estimate q(z) were located (cf. Section 3.2) 




4. Variable selection 



We illustrate the methodology for non-normal data with the Pima Indians diabetes data 



set (Frank and Asuncion, 2010 Ripley, 1996), which contains n = 532 complete records 



on diabetes presence and m = 7 associated covariates described in Table [2j First, we 
restrict ourselves to variable selection in the logistic regression model, yielding a model 
space T of size 2 7 = 128. In Section § we will also consider power transformations of 
the covariates. 

Three different prior distributions for the covariance factor g are compared for a fully 
Bayesian analysis: 



Fl f(g) = lG(g | 1/2, n/2), corresponding to the Zellner and Siow (1980) approach; 
F2 f(g) = l/n(l + g/n)~ 2 , corresponding to the hyper-g/n prior (Liang et al. 2008 



p. 416); 

F3 f(g) = lG(g \ 0.001,0.001), which is a standard choice for variance parameters. 
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Variable Description 



y Signs of diabetes according to WHO criteria (Yes = 1, No = 0) 

x\ Number of pregnancies 

X2 Plasma glucose concentration in an oral glucose tolerance test [mg/dl] 

23 Diastolic blood pressure [mm Hg] 

24 Triceps skin fold thickness [mm] 

25 Body mass index (BMI) [kg/m 2 ] 
xq Diabetes pedigree function 

x-j Age [years] 



Table 2 - Description of the variables in the Pima Indians diabetes data set. 



We also consider model-specific empirical-Bayes estimation of g using the conditional 



marginal likelihood (13), abbreviating this approach as EB. Moreover, the standard 
criteria AIC and BIC are computed for each model. We use the prior model probabilities 

-1 



/(7) 



1 



m + 1 



(27) 



for an appropriate multiplicity adjustment (George and McCulloch, 1993 Scott and Ber- 



ger, 2010). Posterior model probabilities then follow from ([2]), where for EB the maxim- 



ized conditional marginal likelihood (13) and for BIC the approximation exp( — 1/2 BIC) 
(e.g. Kass and Raftery, 1995) is used instead of f(y\^f)- Similar model weights pro- 



portional to exp(— 1/2 AIC) can also be calculated for AIC as proposed by Buckland, 
Burnham, and Augustin ( |1997 ). 



In Table [3j the resulting posterior probabilities and AIC weights for variable inclusion 
are shown. All methods clearly select x±, X2, 25 and xq. The corresponding model is the 
maximum a posteriori (MAP) model in Fl, F2, F3 and BIC, while for EB and AIC also 
xj is included in the top model. This covariate would be included as well in the median 



probability model (Barbieri and Berger, 2004) for all methods except BIC. For 23 and 



24, the evidence for inclusion is consistently weak. For comparison, Holmes and Held 



(2006) used vague iid normal priors for all coefficients and a flat model prior 7(7) 
obtaining clear evidence for inclusion of the MAP covariates. 
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Fl F2 F3 EB AIC BIC 



X\ 


0.961 


0.965 


0.968 


0.970 


0.972 


0.946 


X2 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


X3 


0.252 


0.309 


0.353 


0.384 


0.309 


0.100 


J'l 


0.248 


0.303 


0.346 


0.376 


0.296 


0.103 


X 5 


0.998 


0.998 


0.998 


0.998 


0.998 


0.997 


xq 


0.994 


0.995 


0.996 


0.996 


0.998 


0.987 


XI 


0.528 


0.586 


0.629 


0.659 


0.670 


0.334 



Table 3 - Posterior probabilities and AIC weights for variable inclusion in the Pima data. 



It is interesting that the inclusion probabilities under Fl, F2 and F3 are qualitatively 
similar. The reason could be that the sample size is relatively large in this example, 
reducing the importance of the hyperprior specification for g. For EB, most inclusion 
probabilities are even higher than for F3. The AIC weights are more similar to F2 
probabilities (except for x^). The BIC based probabilities are mostly lower, and close to 
the (not shown) probabilities under Fl when a flat model prior is used. 

While the posterior inclusion probabilities are visibly different for the six approaches, 
the model-averaged fits to the data are very close, as shown in Figure [3j In parallel 
to estimating the posterior parameter distributions leading to these fitted probabilities 
for Fl, F2, F3 and EB, we also estimated the marginal likelihood by MCMC. The 
resulting MCMC estimates were close to the ILA estimates, comparison plots looking 
like Figure [4] for F3. Note that the coverage of the MCMC confidence intervals is lower 
than in Figure [2b| because the ILA approximations are not exact. 



5. Fractional polynomials 

Fractional polynomials (FPs) are used for systematic power transformations of the cov- 



ariates x±, . . . ,x m (Royston and Altman, 1994). They widen the class of ordinary poly- 
nomials insofar as the powers are taken from the fixed set {—2, —1, —1/2, 0, 1/2, 1, 2, 3}, 
which also contains square roots, reciprocals and the logarithm by the |Box and Tidwell| 
(1962) convention x° = log(x). For each covariate Xk, at most two powers are chosen 
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Figure 3 - Model-averaged fitted probabilities in the Pima Indians variable selection example. 




i i i i i r 

100 200 300 400 500 
Individual index, ordered by Fl fit 



Figure 4 - Errors of ILA with respect to MCMC estimates of the marginal likelihood under 
F3, for all 128 models in the Pima Indians variable selection example. MCMC 
estimates are based on (at least) B = 5000 samples which were saved after burn- 
ins of length 1000 (every 2nd iteration). The vertical bars represent 95% MCMC 
confidence intervals (coverage is 72.7% here). 
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and collected in the tuple p k , while the corresponding coefficients are collected in the 
vector a k , determining the FP transform x^ k cx k . The special case p k \ = Pk2 is handled 
by multiplication with the logarithm, e.g. x k 2 ' 2 ^ = (x\, x\ log(xfc)) . Variable selection 
is embedded in this framework, because x k is not included in the model if p k = 0. 
Each model is thus uniquely identified by 7 = (p 1 , . . . ,p m ), the covariate vectors are 
Xyi = (x\l, . . . , x^) T and the vector of regression coefficients is /3 7 = (af , . . . , cxJ n ) T ■ 



Sabanes Bove and Held (2010) implemented Bayesian inference in normal linear FP 



models, and more details on FPs can be found in references therein. 

The model space V comprises 45 m models, and thus the use of an automatic prior for 
the parameter /3 , conditional on the model 7, is very attractive. The generalized g- 
prior ^ is automatic and only depends on the global hyperparameter g. We will again 
compare the three fully Bayesian approaches (Fl, F2, F3) with the empirical-Bayes 
procedure (EB) which were introduced in Section [4] and avoid manual specification of g. 
The prior model probabilities 7(7) = HfcLi f(Pk) depend on the prior FP transformation 
probabilities 

™-iCw i r 



which have the same form as ( |27| ): each degree \p k \ G {0,1,2} is equally probable, 
and all tuples p k of the same degree are equally probable. This implements Jeffreys 's 
"simplicity postulate" that simpler models must have greater prior probability than more 



complex models (Jeffreys, 1961, section 1.6), indeed the null model has the largest prior 
probability 3 _m . 

For the Pima data the model space F has size 45 7 ~ 3.7 • 10 , rendering an exhaust- 
ive evaluation of all 7 G T infeasible. Therefore we use an MCMC model composition 



(Madigan and York, 1995) approach: Starting from the null model, we move through T 
by successive slight modifications of the configuration 7. The modifications are accep- 
ted with MH acceptance probabilities, which ensures that models with higher posterior 



probability are more likely to be visited; see Sabanes Bove and Held (2010) for details. 
For all four approaches (Fl, F2, F3 and EB), we ran this model sampler for one million 
iterations. To get an idea of the computational complexity, note that on average 10.8 
(F2) and 22.1 (EB) models could be evaluated per second (on 2.8 GHz CPUs). All com- 
putations have been implemented in an R-package including an efficient C++ core for the 
MCMC parts, which is available from the first author. 
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Fl 


F2 


F3 


EB 


x\ 


0.119 


0.125 


0.135 


0.144 


X2 


1.000 


1.000 


1.000 


1.000 


X3 


0.050 


0.052 


0.054 


0.054 


X,{ 


0.032 


0.033 


0.033 


0.035 


x 5 


0.999 


0.999 


0.999 


0.999 


x 6 


0.992 


0.993 


0.993 


0.994 


x 7 


0.999 


0.999 


0.999 


0.999 



Table 4 - Posterior probabilities for variable inclusion in the Pima data when FP transform- 
ations are considered. The probabilities are based on 671525 (Fl), 719 929 (F2), 
758 616 (F3), and 777531 (EB) visited models. 

For all four approaches Table [4] shows clear evidence for the covariates x<i , X5 , xq and 
xj with posterior inclusion probabilities over 99%, while the other three covariates have 
probabilities below 15%. In comparison with the variable inclusion results for the un- 
transformed covariates in Table [3j it is interesting that x\ is no longer important when 
FP transformations are considered, while xj is much more important. 

In addition to examining the marginal inclusion probabilities, it is necessary to look at 
the transformations of the covariates. Since all four approaches produce similar variable 
inclusion probabilities and also share the MAP model x,- L = (x2i, x^ , x 6 ^ 2 , x^- 2 ) T , we 
only look in detail at the approach Fl (the three other producing again very similar 
results). In order to account for the model uncertainty, it is best to look at model- 
averaged estimates of variable transformations, conditional on variable inclusion. To 
this end we varied the transformation of one of the covariates X2,x$,xq,X7 while fixing 
the others at their MAP configuration. Averaging over the 44 models each then results in 
the panels in Figure [5j Plasma glucose concentration (^2) seems to have a strong positive 
linear association with diabetes log-odds, while the estimated positive effect of BMI (2:5) 
is levelling off non-linearly for (rare) high values and is weaker overall. Even smaller is 
the estimated positive effect of diabetes pedigree function (xq) with the largest increase 
in diabetes risk between x§ = 0.1 and x§ = 0.5. The remaining estimated association 
of age (27) is clearly non-linear, with higher diabetes risk for middle-aged participants. 
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These results are qualitatively similar to those obtained by Cottet, Kohn, and Nott 



(2008, p. 665) for a larger subset of the original Pima Indians data. 

The marginal posterior distributions for the covariance factor g differ slightly between 
the three prior choices Fl, F2 and F3. Averaging over the best 1000 models in terms 
of posterior probability which have been visited by the model sampler, we get the histo- 
grams (for z = log(g)) in Figure [6| The corresponding posterior means K(g \ y) decrease 
from 282.5 for Fl, 219.2 for F2 to 179.1 for F3, and this trend is also visible in the histo- 
grams. The results suggest a stronger prior shrinkage of the regression coefficients than 



that proposed by the unit information prior's fixed value g = n = 532 (cf. Section 2.2), 
as F(g <n\ y) ranges from 90.9% for Fl to 95.7% for F3. 

6. Discussion 

In this article, we presented a generalization of the g-prior to GLMs, which can be 
interpreted analogously to the classical g-prior for normal linear models. In our imple- 
mentation, the shrinkage-controlling hyperparameter g can be assigned any hyperprior, 
thus giving rise to a large class of generalized hyper-g priors. For mixtures of classical 



g-priors, Liang et al. (2008) could investigate theoretical model selection and prediction 
consistency properties. It would be desirable to also investigate such properties for our 
generalized hyper-<7 prior class. However, as fewer closed form expressions are available, 
derivation of comparable proofs will be more difficult in the GLM family. 

Another important area of future research is the thorough comparison of the general- 



ized hyper-g prior with the other approaches in the literature summarized in Section 2.2 
For example, exhaustive simulation studies could shed light on different performances of 
the priors in variable selection. Perhaps also theoretical results can be derived to explain 
the different properties of the approaches. 

Bayesian inference for FPs in GLMs was in fact the motivating application for this 
work. With huge model spaces to explore, the accurate numerical marginal likelihood 
approximation is vital for this and similar typical applications of the generalized hyper-g 
prior. Alternative MCMC estimates of the marginal likelihood were used to demonstrate 
the very good accuracy of the ILA. Yet, MCMC would not be suited for replacing the 
deterministic ILA approach in the stochastic model search, because the computation 
is slower by orders of magnitude and would require careful automatic monitoring of 
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Figure 5 - Model-averaged FP transformations of selected Pima Indians covariates under hy- 
pcrprior Fl. Means (solid lines), pointwise (dashed lines) as well as simultaneous 
(dotted lines) 95% credible intervals are given. Small ticks above the x-axes indicate 
data locations. 
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Figure 6 



— Comparison of marginal posteriors for z — log(g) under priors Fl, F2 and F3. The 
histograms are based on the model average over the respective 1000 models with 
highest posterior probability visited by the model samplers. 




convergence. Of course, the deterministic marginal likelihood approximation could be 
used for any type of stochastic model search, such as those recently proposed by |Hans, 



Dobra, and West (2007) and Dobra (2009). 



Finally, we note that the classical g-prior has recently been extended in other direc- 



tions as well. In the context of supervised machine learning, Zhang, Jordan, and Yeung 



(2008) replace X^X^ by a (possibly singular) kernel matrix and prove consistency 
properties for the normal linear model. |Maruyama and George (2008) remove the re- 
striction of j? 7 < n — 1 for normal linear models by working with the singular value 
decomposition (SVD) of the design matrix _X~ 7 . A similar extension is the "generalised 
singular g-prior" defined by West (2003) in the factor regression context. Along these 
lines, our generalized hyper-g prior could also be extended to the p-y > n case via the SVD 
W 1 I 2 X 1 = U-fD^Vry. We could just use the latent parameter <5 7 = V 7 /3 7 of reduced 



dimension k, 



7 



n 



1 instead of /3„ 



V^Sy. Defining the corresponding design matrix 



as Zy = W l ' 2 U 1 D 1 , we have Xy/3 y = Z 1 8 1 and retain Z^l n = 0^. Assigning 



the prior distribution <5 7 ~ N^ 7 (Ofc 7 , g4>cD~ ) then induces a normal prior on /3 7 with 
mean zero and singular precision (g<f)c)~ 1 X'^WX 1 , and thus directly generalizes ([6]). 
Investigation of this approach for GLMs with many covariates is another possibility for 
future research. 
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A. Proof of prior mode zero 



Consider the density function from ([5]). Dropping for brevity the notational dependency 
on the model 7, it can be rewritten as 



f(J3 I g, y Q ) oc exp | ±w T (h(0)O - b(0)) j , 



(29) 



where = . . . , 9 n ) T and 6(0) = (&(0i), . . . , &(0„)) ■ To prove that the mode is at 
(3 = P , note that this is a solution of the score equation 







log f(J3 1 g, y ) 



1 



h(0) 



80 db{0) 86 



T 



Qp-ciM-,*,™ g( f)\-^'d(3 T dO T d(3 T 
because (3 = P implies that b'(0i) = b'(9) = (jl = h(0) and hence 

db(8 



w = 0„, 



86 



T 



diag^fli),..., &'(#„)) =h(0)I r 



B. Higher-order Laplace approximation 



Denote the standard Laplace approximation (14) by fhA{y\g,l)- Then 



Raudenbush 



et al. (2000 p. 148) show that 



f(y\g,i) ~ fhA{y\g,i) 



n n 

i- l y dfhj - - y dfhl + -k T {my l k 

g Z-^/ % 1 4g 1 1 24 7 



(30) 



i=l i=l 

is a sixth-order Laplace approximation when the canonical response function is used. 



Here d. 



(m) 
;(2) 



d m h/dr] m (r]*) evaluated at rj* = x^/3*^, h 



Xq^^Rq^) 1 xo 1 i and k 



Yli=l d\ h&Oyi- Note that the quadratic forms can be efficiently computed using the 
Cholesky decomposition i?Q 7 = LL T , e.g. k T (R*Q 1 )~ 1 k = \\v\\ 2 where Lv = k. 
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